I have a dataset (Prime_Vul) for fine-tuning.
The data format is as follows: desired input and output:
JSON
{
“idx”: 201849,
“input”: “void sqlite3Pragma( Parse *pParse, Token *pId1, …rest of the code for input }”,
“output”: {
“is_vulnerable”: true,
“vulnerability_types”: [“Improper Check for Unusual or Exceptional Conditions”],
“explanation”: “pragma.c in SQLite through 3.30.1 mishandles NOT NULL in an integrity_check PRAGMA command in certain cases of generated columns.”,
“severity_level”: “NoInfo”,
“cwe”: [“CWE-754”],
“cve”: “CVE-2019-19646”
},
“code_token_length”: 19000,
“total_token_length”: 20042,
“max_tokens_setting”: 32768
}
My question is, if I want to tokenize this dataset and then use it for the fine-tuning process, which parts exactly should I tokenize?
All the data? Or just the input?
Please guide me.
you need to tokenize everything that the model will see and learn to predict. The output text must also be tokenized so it can appear in the model’s decoder during training and contribute to the loss.
thank you so much.
yes that helped a lot.
but now i have more question, i would appreciate if you help me with these too.
1: should i also tokenise the labels?
2: i should use padding correct?
so in every batch of dataset (different maximum token length) the input sequence be the same length for every function the code receives during fine tuning?
3: should i also use padding on the output?
4: and if yes, should i padd the input and output separately? or as one?
Again, thank you for the first reply.
I would tokenize the necessary labels, you can drop unneccessary labels. i would tokenize the prompt and completion is. Then you can carve out where the tokens get delegated. I would use a max length padding. Yes, the padding must match the padding shape of your input padding.
Something like this batch = tokenizer(
prompts,
completions,
padding=“longest”,
truncation=True,
return_tensors=“pt”
)